<!DOCTYPE html>
<html>
  <head>
    <meta charset="utf-8" />
    <meta name="keywords" content="remote machines, Git, GitHub" />
    <meta name="description" content="Lecture introducing remote machines and version control using Git." />
    <title>BCB 546X -- 17 Jan 2017</title>
    <style>
      @import url(https://fonts.googleapis.com/css?family=Droid+Serif);
      @import url(https://fonts.googleapis.com/css?family=Yanone+Kaffeesatz);
      @import url(https://fonts.googleapis.com/css?family=Ubuntu+Mono:400,700,400italic);

      body {
        font-family: 'Droid Serif';
      }
      h1, h2, h3 {
        font-family: 'Yanone Kaffeesatz';
        font-weight: 400;
        margin-bottom: 0;
      }
      .remark-slide-content h1 { font-size: 3em; }
      .remark-slide-content h2 { font-size: 2em; }
      .remark-slide-content h3 { font-size: 1.6em; }
      .footnote {
        position: absolute;
        bottom: 3em;
      }
      li p { line-height: 1.25em; }
      .red { color: #fa0000; }
      .large { font-size: 2em; }
      a, a > code {
        color: rgb(249, 38, 114);
        text-decoration: none;
      }
      code {
        background: #e7e8e2;
        border-radius: 5px;
      }
      .remark-code, .remark-inline-code { font-family: 'Ubuntu Mono'; }
      .remark-code-line-highlighted     { background-color: #373832; }
      .pull-left {
        float: left;
        width: 47%;
      }
      .pull-right {
        float: right;
        width: 47%;
      }
      .pull-right ~ p {
        clear: both;
      }
      #slideshow .slide .content code {
        font-size: 0.8em;
      }
      #slideshow .slide .content pre code {
        font-size: 0.9em;
        padding: 15px;
      }
      .inverse {
        background: #272822;
        color: #777872;
        text-shadow: 0 0 20px #333;
      }
      .inverse h1, .inverse h2 {
        color: #f3f3f3;
        line-height: 0.8em;
      }
      /* Two-column layout */
      .left-column {
        color: #777;
        width: 20%;
        height: 92%;
        float: left;
      }
        .left-column h2:last-of-type, .left-column h3:last-child {
          color: #000;
        }
      .right-column {
        width: 75%;
        float: right;
        padding-top: 1em;
      }
    </style>
  </head>
  <body>
    <textarea id="source">

class: center, middle

# UNIX Data Tools
## Buffalo Chapter 7

???

Notes for the _first_ slide!

---

# Overview

--

## In Chapter 3 we learned the basic operations within the Unix shell:

--

* standard out and standard error streams of data

--

* how to redirect our data streams

--

* how to efficiently run a series of commands using pipes

--
	
* how to manage command processes

--

## Here, we'll learn a number of UNIX tools that will allow us to inspect and manipulate data

---

## Inspecting a data file for the first time: `head`

--

* Use the `cd` command to navigate into the `chapter-07-unix-data-tools` folder in the Buffalo online resources

--

* We can inspect a file by using the `cat` command to print its contents to the screen:

```
$ cat Mus_musculus.GRCm38.75_chr1.bed
```

--

* That's a little unwieldly...perhaps we just want to see the first few lines of a file to see how it's formatted.  Let's try:

```
$ head Mus_musculus.GRCm38.75_chr1.bed
```

--

* If we want to see less or more of a given file, we can specify the number of lines using the `-n` option:

```
$ head -n 3 Mus_musculus.GRCm38.75_chr1.bed
```
---

## Inspecting a data file for the first time: `tail`

--

* Similar to `head`, you can use the `tail` command to inspect the end of a file:

```
$ tail -n 3 Mus_musculus.GRCm38.75_chr1.bed
```
--

* `tail` can also be useful for removing the header of a file; this is particularly useful when concatenating files for an analysis:

```
$ tail -n +2 genotypes.txt
```
--

* And here's a handy trick for inspecting both the head and tail of a file simultaneously:

```
$ (head -n 2; tail -n 2) < Mus_musculus.GRCm38.75_chr1.bed
1 	3054233	3054733
1 	3054233	3054733
1 	195240910	195241007
1	 195240910	195241007
```

---

## Additional uses of `head`

--

* We can also use `head` to inspect the first bit of output of a UNIX pipeline:

```
$ grep 'gene_id "ENSMUSG00000025907"' Mus_musculus.GRCm38.75_chr1.gtf | head -n 1
```
--

* When including head at the end of a complex UNIX pipeline, the pipeline will only run until it produces the number of lines dictated by `head`

--

* Why is this important or useful?     This dummy pipeline may help:

```
$ grep "some_string" huge_file.txt | program1 | program2 | head -n 5
```

---

## Inspecting files and pipes using `less`

--

* `less` is what is known as a "terminal pager"; it allows us to view large amounts of text in our terminal

--

* Whereas with `cat` the contents of our file flash before our eyes, with `less` we can view and scroll through the file's contents

--

* Let's observe the difference between `cat` and `less` using a file from the Buffalo Chapter 7 materials:

Try:

```
$ cat contaminated.fastq
```
--

Then try:

```
$ less contaminated.fastq
```
--

* While viewing the file in `less` try navigating with the space bar and the `b`, `j`, `k`, `g`, and `G` keys.  To exit the file, press `q`
---

## Using `less` to highlight text matches and check pipes

--

* Highlighting text matches can allow us to search for potential problems in data

--

* For example, imagine we download useful Illumina data from another study and it's not clear from the documentation whether adapter sequence has been trimmed

--

* We can search for a known 3' adapter sequence using `less`:

```
$ less contaminated.fastq

# then press / and enter AGATCGG
```

--

* `less` can also be used to check the individual components of a pipe under construction:

```
$ step1 input.txt | less
$ step1 input.txt | step2 | less
$ step1 input.txt | step2 | step3 | less
```
--

* The commands will only run until a page of your terminal is full, limiting computation time

---


    </textarea>
    <script src="https://gnab.github.io/remark/downloads/remark-latest.min.js" type="text/javascript">
    </script>
    <script>
      var hljs = remark.highlighter.engine;
    </script>
    <script src="remark.language.js"></script>
    <script>
      var slideshow = remark.create({
          highlightStyle: 'monokai',
          highlightLanguage: 'tex',
          highlightLines: true
        }) ;
    </script>
    <script>
      var _gaq = _gaq || [];
      _gaq.push(['_setAccount', 'UA-44561333-1']);
      _gaq.push(['_trackPageview']);

      (function() {
        var ga = document.createElement('script');
        ga.src = 'https://ssl.google-analytics.com/ga.js';
        var s = document.scripts[0];
        s.parentNode.insertBefore(ga, s);
      }());
    </script>
  </body>
</html>
